In this document, we would like to use an example project to show how we can write reports in RStudio using R Markdown.
We use an example gene expression data set which includes gene expression values (n = 11900) for 10 samples, covering two conditions (Control and TGFb-treated). This data is a subset of an integrated data set from this paper. We also have a expression signatures, called TGFb-EMT signature generated in the same paper. Both the data subset and the signature are available from the singscore R/Bioconductor package.
The purpose of this project is to find samples that are more concordant with the TGFb-EMT signature. To do this, we need to use a gene-set scoring method and samples’ transcriptional profiles to score samples against this signature, and then compare their scores.
To score samples, we use the singscore method, which is available as an R/Bioconductor package. If you are interetsed in the method, you can check the workflow paper by Bhuva et al, Using singscore to predict mutations in acute myeloid leukemia from transcriptomic signatures.
library(knitr)
library(DT)
library(plotly)
library(singscore)
library(SummarizedExperiment)
library(GSEABase)
The example data available through singscore package is called tgfb_expr_10_se and is a SummarizedExperiment object. This object can include three main components: assay (e.g. the gene expression matrix), rowData (e.g. data frame for gene information), and colData (e.g. data frame for sample information).
Looking at the head of the expression data, we see that samples are in columns, and each row represents expression of a gene, with Entrez IDs as row names.
datatable(assay(tgfb_expr_10_se)[1:5, 1:4])
We would like to add more annotation file to the samples; we add these to the colData slot of the tgfb_expr_10_se object.
tgfb_expr_10_se@colData$Replicate[grepl("R1", rownames(tgfb_expr_10_se@colData))] <- "Replicate 1"
tgfb_expr_10_se@colData$Replicate[grepl("R2", rownames(tgfb_expr_10_se@colData))] <- "Replicate 2"
tgfb_expr_10_se@colData$Study[grepl("D_", rownames(tgfb_expr_10_se@colData))] <- "Deshieri"
tgfb_expr_10_se@colData$Study[grepl("Hes_", rownames(tgfb_expr_10_se@colData))] <- "Hesling"
tgfb_expr_10_se@colData$Study[grepl("Hil_", rownames(tgfb_expr_10_se@colData))] <- "Hill"
We generate an interactive table using DT package.
DT::datatable(data.frame(colData(tgfb_expr_10_se)), filter = "top")
datatable(data.frame(colData(tgfb_expr_10_se))) %>%
formatStyle("Treatment", backgroundColor = styleEqual(unique(colData(tgfb_expr_10_se)$Treatment), c("lightseagreen", "orange")))
The TGFb-EMT gene expression signature has up-regulated gene set as well as down-regulated gene-set. A sample with high evidence of TGFb-EMT signature shows high expression of the up-regulated genes and low expression of the down-regulated genes compared to samples with low evidence of this signature. Looking at the up- and down- regulated genes, we see that they are Entrez IDs.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
## head of gene sets
head(geneIds(tgfb_gs_up))
## [1] "19" "87" "182" "241" "323" "406"
head(geneIds(tgfb_gs_dn))
## [1] "136" "220" "224" "288" "330" "360"
We need to rank the expression data first.
rankedData <- rankGenes(expreMatrix = assay(tgfb_expr_10_se))
datatable(head(rankedData))
sigScore_tgfb <-
simpleScore(
rankData = rankedData,
upSet = tgfb_gs_up,
downSet = tgfb_gs_dn,
subSamples = NULL,
centerScore = T,
dispersionFun = "MAD",
knownDirection = T
)
datatable(sigScore_tgfb, filter = "top")
We visualise scores in samples with the highest and lowest scores.
plotDispersion(scoredf = sigScore_tgfb,
annot = colData(tgfb_expr_10_se)$Treatment,
annot_name = "Group",
isInteractive = T, textSize = 0.5)
plotRankDensity(
rankData = rankedData[, "D_TGFb_R1", drop = F],
upSet = tgfb_gs_up,
downSet = tgfb_gs_dn,
isInteractive = T
)
plotRankDensity(
rankData = rankedData[, "D_Ctrl_R2", drop = F],
upSet = tgfb_gs_up,
downSet = tgfb_gs_dn,
isInteractive = T
)
Read EMT signature from Thiery et al .
emt <- read.table("../data/Thiery_EMTsignature_both_tumour_cellLine_EntrezIDs.txt", header = T, sep = "\t")
epi <- emt$EntrezGene.ID[emt$epiMes_cellLine == "epi" ]
mes <- emt$EntrezGene.ID[emt$epiMes_cellLine == "mes" ]
epi <- epi[complete.cases(epi)]
mes <- mes[complete.cases(mes)]
sigScore_epi <-
simpleScore(
rankData = rankedData,
upSet = epi,
subSamples = NULL,
centerScore = T,
dispersionFun = "MAD",
knownDirection = T
)
sigScore_mes <-
simpleScore(
rankData = rankedData,
upSet = mes,
subSamples = NULL,
centerScore = T,
dispersionFun = "MAD",
knownDirection = T
)
p1 <- plotScoreLandscape(
scoredf1 = sigScore_epi,
scoredf2 = sigScore_mes,
scorenames = c("Epithelial", "Mesenchymal")
)
projectScoreLandscape(
plotObj = p1,
scoredf1 = sigScore_epi,
scoredf2 = sigScore_mes,
annot = colData(tgfb_expr_10_se)$Treatment,
annot_name = "Group",
sampleLabels = colnames(rankedData),
isInteractive = T
)
(#fig:EMT landscape)Epithelial-Mesenchymal landscape from Cursons et al paper published in Cell Systems
TCGA_EMTscore <- readRDS('../data/TCGA_BrCr_EMT_score.RDS')
p2 <- plotScoreLandscape(
scoredf1 = TCGA_EMTscore$TCGA_BrCr_Epi,
scoredf2 = TCGA_EMTscore$TCGA_BrCr_Mes,
scorenames = c("Epithelial", "Mesenchymal")
)
projectScoreLandscape(
plotObj = p2,
scoredf1 = sigScore_epi[1:4, ],
scoredf2 = sigScore_mes[1:4, ],
annot = colData(tgfb_expr_10_se)$Treatment[1:4],
annot_name = "Group",
sampleLabels = colnames(rankedData)[1:4],
isInteractive = T
)
p <- plot_ly(sigScore_epi, x = ~ TotalScore, y = ~ TotalDispersion,
color = ~ colData(tgfb_expr_10_se)$Treatment)
p
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [3] BiocParallel_1.18.0 matrixStats_0.54.0
## [5] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
## [7] singscore_1.4.0 GSEABase_1.46.0
## [9] graph_1.62.0 annotate_1.62.0
## [11] XML_3.98-1.20 AnnotationDbi_1.46.0
## [13] IRanges_2.18.1 S4Vectors_0.22.0
## [15] Biobase_2.44.0 BiocGenerics_0.30.0
## [17] plotly_4.9.0 ggplot2_3.2.0
## [19] DT_0.7 knitr_1.23
## [21] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.0 edgeR_3.26.5 tidyr_0.8.3
## [4] bit64_0.9-7 jsonlite_1.6 viridisLite_0.3.0
## [7] shiny_1.3.2 assertthat_0.2.1 highr_0.8
## [10] BiocManager_1.30.4 blob_1.1.1 GenomeInfoDbData_1.2.1
## [13] yaml_2.2.0 pillar_1.4.2 RSQLite_2.1.1
## [16] lattice_0.20-38 glue_1.3.1 limma_3.40.2
## [19] digest_0.6.19 RColorBrewer_1.1-2 promises_1.0.1
## [22] XVector_0.24.0 colorspace_1.4-1 plyr_1.8.4
## [25] htmltools_0.3.6 httpuv_1.5.1 Matrix_1.2-17
## [28] pkgconfig_2.0.2 bookdown_0.11 zlibbioc_1.30.0
## [31] purrr_0.3.2 xtable_1.8-4 scales_1.0.0
## [34] later_0.8.0 tibble_2.1.3 withr_2.1.2
## [37] hexbin_1.27.3 lazyeval_0.2.2 mime_0.7
## [40] magrittr_1.5 crayon_1.3.4 memoise_1.1.0
## [43] evaluate_0.14 tools_3.6.0 data.table_1.12.2
## [46] stringr_1.4.0 munsell_0.5.0 locfit_1.5-9.1
## [49] compiler_3.6.0 rlang_0.4.0 grid_3.6.0
## [52] RCurl_1.95-4.12 htmlwidgets_1.3 crosstalk_1.0.0
## [55] labeling_0.3 bitops_1.0-6 rmarkdown_1.13
## [58] gtable_0.3.0 DBI_1.0.0 reshape2_1.4.3
## [61] R6_2.4.0 dplyr_0.8.2 bit_1.1-14
## [64] stringi_1.4.3 Rcpp_1.0.1 tidyselect_0.2.5
## [67] xfun_0.8